GIFs (Graphics Interchange Formats) are frequently used as responses to posts on social media platforms, but many approaches do not make good use of the GIF tag information on social media when dealing with the question “how to choose an appropriate GIF to reply to a post”. A Multi-Modal Dialog reply retrieval based on Contrast learning and GIF Tag (CoTa-MMD) approach was proposed, by which the tag information was integrated into the retrieval process. Specifically, the tags were used as intermediate variables, the retrieval of text to GIF was then converted to the retrieval of text to GIF tag to GIF. Then the modal representation was learned by a contrastive learning algorithm and the retrieval probability was calculated using a full probability formula. Compared to direct text image retrieval, the introduction of transition tags reduced retrieval difficulties caused by the heterogeneity of different modalities. Experimental results show that the CoTa-MMD model improved the recall sum of the text image retrieval task by 0.33 percentage points and 4.21 percentage points compared to the DSCMR (Deep Supervised Cross-Modal Retrieval) model on PEPE-56 multimodal dialogue dataset and Taiwan multimodal dialogue dataset, respectively.